This section starts with a Project that already contains alignments:


In [1]:
from reprophylo import *
pj = unpickle_pj('./outputs/my_project.pkpj',
                 git=False)


DEBUG:Cloud:Log file (/home/amir/.picloud/cloud.log) opened

If we call the keys of the pj.alignments dictionary, we can see the names of the alignments it contains:


In [2]:
pj.alignments.keys()


Out[2]:
['28s@muscleDefault', 'MT-CO1@mafftLinsi', '18s@muscleDefault']

3.7.1 Configuring an alignment trimming process

Like the sequence alignment phase, alignment trimming has its own configuration class, the TrimalConf class. An object of this class will generate a command-line and the required input files for the program TrimAl, but will not execute the process (this is shown below). Once the process has been successfully executed, this TrimalConf object is also stored in pj.used_methods and it can be invoked as a report.

3.7.1.1 Example1, the default gappyput algorithm

With TrimalConf, instead of specifying loci names, we provide alignment names, as they appear in the keys of pj.alignments


In [3]:
gappyout = TrimalConf(pj,                                # The Project
                    
                      method_name='gappyout',            # Any unique string ('gappyout' is default)
                    
                      program_name='trimal',             # No alternatives in this ReproPhylo version
                    
                      cmd='default',                     # the default is trimal. Change it here
                                                         # or in pj.defaults['trimal']
                    
                      alns=['MT-CO1@mafftLinsi'],        # 'all' by default
                    
                      trimal_commands={'gappyout': True} # By default, the gappyout algorithm is used.
                      )


trimal -in 587711440759152.37_MT-CO1@mafftLinsi.fasta -gappyout

3.7.1.2 List comprehension to subset alignments

In this example, it is easy enough to copy and paste alignment names into a list and pass it to TrimalConf. But this is more difficult if we want to fish out a subset of alignments from a very large list of alignments. In such cases, Python's list comprehension is very useful. Below I show two uses of list comprehension, but the more you feel comfortable with this approach, the better.

Getting locus names of rRNA loci
If you read the code line that follows very carefully, you will see it quite literally says "take the name of each Locus found in pj.loci if its feature type is rRNA, and put it in a list":


In [4]:
rRNA_locus_names = [locus.name for locus in pj.loci if locus.feature_type == 'rRNA']
print rRNA_locus_names


['18s', '28s']

what we get is a list of names of our rRNA loci.

Getting alignment names that have locus names of rRNA loci
The following line says: "take the key of each alignment from the pj.alignments dictionary if the first word before the '@' symbol is in the list of rRNA locus names, and put this key in a list":


In [5]:
rRNA_alignment_names = [key for key in pj.alignments.keys() if key.split('@')[0] in rRNA_locus_names]
print rRNA_alignment_names


['28s@muscleDefault', '18s@muscleDefault']

We get a list of keys, of the rRNA loci alignments we produced on the previous section, and which are stored in the pj.alignments dictionary. We can now pass this list to a new TrimalConf instance that will only process rRNA locus alignments:


In [6]:
gt50 = TrimalConf(pj,
                  method_name='gt50',
                  alns = rRNA_alignment_names,
                  trimal_commands={'gt': 0.5}  # This will keep positions with up to
                                               # 50% gaps.
                  )


trimal -in 915841440759159.29_28s@muscleDefault.fasta -gt 0.5
trimal -in 915841440759159.29_18s@muscleDefault.fasta -gt 0.5

3.7.2 Executing the alignment trimming process

As for the alignment phase, this is done with a Project method, which accepts a list of TrimalConf objects.


In [7]:
pj.trim([gappyout, gt50])

Once used, these objects are also placed in the pj.used_methods dictionary, and they can be printed out for observation:


In [8]:
print pj.used_methods['gappyout']


TrimalConf named gappyout with ID 587711440759152.37
Alignments: MT-CO1@mafftLinsi 
Created on: Fri Aug 28 11:52:32 2015
Commands:
MT-CO1@mafftLinsi@gappyout: trimal -in 587711440759152.37_MT-CO1@mafftLinsi.fasta -gappyout

Environment:Platform: Linux-3.13.0-40-generic-x86_64-with-Ubuntu-14.04-trusty
 Processor: x86_64
 Python build: defaultJun 22 2015 17:58:13
 Python compiler: GCC 4.8.2
 Python implementation: CPython
 Python version: 2.7.6
 ete2 version: 2.2rev1056
 biopython version: 1.64
 dendropy version: 3.12.0
 cloud version: 2.8.5
 reprophylo version 1.0
 User: amir-TECRA-W50-A
 Program and version: trimAl 1.2rev59
 Program reference: Salvador Capella-Gutierrez; Jose M. Silla-Martinez; Toni Gabaldon. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 2009 25: 1972-1973.
execution time:
0.478782892227

==============================
Core Methods section sentence:
==============================
The alignment(s) MT-CO1@mafftLinsi were trimmed using the program trimAl 1.2rev59 [1].

Reference:
Salvador Capella-Gutierrez; Jose M. Silla-Martinez; Toni Gabaldon. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 2009 25: 1972-1973.

3.7.3 Accessing trimmed sequence alignments

3.7.3.1 The pj.trimmed_alignments dictionary

The trimmed alignments themselves are stored in the pj.trimmed_alignments dictionary, using keys that follow this pattern: locus_name@alignment_method_name@trimming_method_name where alignment_method_name is the name you have provided to your AlnConf object and trimming_method_name is the one you provided to your TrimalConf object.


In [9]:
pj.trimmed_alignments


Out[9]:
{'18s@muscleDefault@gt50': <<class 'Bio.Align.MultipleSeqAlignment'> instance (52 records of length 1685, IUPACAmbiguousDNA()) at 7fe542480510>,
 '28s@muscleDefault@gt50': <<class 'Bio.Align.MultipleSeqAlignment'> instance (48 records of length 798, IUPACAmbiguousDNA()) at 7fe542480550>,
 'MT-CO1@mafftLinsi@gappyout': <<class 'Bio.Align.MultipleSeqAlignment'> instance (73 records of length 1107, IUPACAmbiguousDNA()) at 7fe5424d6dd0>}

3.7.3.2 Accessing a MultipleSeqAlignment object

A trimmed alignment can be easily accessed and manipulated with any of Biopython's AlignIO tricks using the fta Project method:


In [10]:
print pj.fta('18s@muscleDefault@gt50')[:4,410:420].format('phylip-relaxed')


returning trimmed alignment object 18s@muscleDefault@gt50
 4 10
KC762720.1_f0  CCAATCCCGA 
KC774024.1_f0  CCAATCCCGA 
KC762713.1_f0  CCAATCGGGA 
KC762708.1_f0  CCAATCCCGA 

3.7.3.3 Writing trimmed sequence alignment files

Trimmed alignment text files can be dumped in any AlignIO format for usage in an external command line or GUI program. When writing to files, you can control the header of the sequence by, for example, adding the organism name of the gene name, or by replacing the feature ID with the record ID:


In [11]:
# record_id and source_organism are feature qualifiers in the SeqRecord object
# See section 3.4
files = pj.write_trimmed_alns(id=['record_id','source_organism'],
                                   format='fasta')
files


Out[11]:
['28s@muscleDefault@gt50_trimmed_aln.fasta',
 '18s@muscleDefault@gt50_trimmed_aln.fasta',
 'MT-CO1@mafftLinsi@gappyout_trimmed_aln.fasta']

The files will always be written to the current working directory (where this notebook file is), and can immediately be moved programmatically to avoid clutter:


In [12]:
# make a new directory for your trimmed alignment files:
if not os.path.exists('trimmed_alignment_files'):
    os.mkdir('trimmed_alignment_files')
    
# move the files there
for f in files:
    os.rename(f, "./trimmed_alignment_files/%s"%f)

3.7.3.4 Viewing trimmed alignments

Trimmed alignments can be viewed in the same way as alignments, but using this command:


In [13]:
pj.show_aln('MT-CO1@mafftLinsi@gappyout',id=['source_organism'])

In [15]:
pickle_pj(pj, 'outputs/my_project.pkpj')


Out[15]:
'outputs/my_project.pkpj'

3.7.4 Quick reference


In [ ]:
# Make a TrimalConf object
trimconf = TrimalConf(pj, **kwargs)

# Execute alignment process
pj.trim([trimconf])

# Show AlnConf description
print pj.used_methods['method_name']

# Fetch a MultipleSeqAlignment object
trim_aln_obj = pj.fta('locus_name@aln_method_name@trim_method_name')

# Write alignment text files
pj.write_trimmed_alns(id=['some_feature_qualifier'], format='fasta')
# the default feature qualifier is 'feature_id'
# 'fasta' is the default format

# View alignment in browser
pj.show_aln('locus_name@aln_method_name@trim_method_name',id=['some_feature_qualifier'])

In [ ]: